This notebook details the process of analysing tweets by train company accounts.
We are interested in the following questions:
First we need to import the data, which is over 100,000 tweets. There are a few ways of doing this: by using the link to the latest CSV, online…
…or by using the API endpoint which provides it in JSON:
You can also use this to get the results of SQL queries:
countbyaccount <- jsonlite::fromJSON("https://premium.scraperwiki.com/3epab2y/iyjilhnvxo69dnp/sql/?q=%20select%20count(*)%2C%20account%0A%20from%20traincompanytweets%20%0A%20group%20by%20account%0A%20order%20by%20count(*)%20desc")
countbyaccount
…Or by importing a local file (the disadvantage of this is that it is not a live link to the latest data):
#Store the filename
csvfile = "traincompanytweets.csv"
tweets <- read.csv(csvfile)
Next some overview of accounts:
table(tweets$account)
ArrivaTW c2c_Rail chilternrailway Elizabethline
306 3232 4199 3900 638
EMTrains ga_mainline ga_regional ga_westanglia GNRailUK
4525 3660 3654 3673 4665
greateranglia GWRHelp HeathrowExpress LDNOverground LNER
6034 6001 3219 3228 3208
LNRailway merseyrail nationalrailenq networkrail northernassist
3224 3204 3243 3221 3233
NRE_TfLRail ScotRail Se_Railway SouthernRailUK Stansted_Exp
1919 3206 3219 5329 4081
SW_Help SW_Railway TfLRail TLRailUK tlupdates
3225 3220 3245 5054 5066
TPEAssist VirginTrains WestMidRailway
3214 3243 3230
And a summary of columns:
summary(tweets)
tweetdate
Fri Nov 23 06:29:33 +0000 2018: 19
Mon Oct 29 08:56:24 +0000 2018: 11
Mon Oct 08 09:46:53 +0000 2018: 8
Thu Nov 01 11:44:43 +0000 2018: 8
Fri Nov 23 11:12:35 +0000 2018: 7
Sun Nov 25 05:59:57 +0000 2018: 7
(Other) :117458
tweettxt
"Services are running on time and our team will be here to help from 7am (Mon-Fri) or 8am (Sat-Sun) #hexupdates" : 123
"Heathrow Express services are running on time #hexupdates" : 68
"Travelling home with us? Please check before you travel for live updates on your journey - https://t.co/NWZnb59sUi" : 47
"Services are running on time this morning and our team will be here to help from 7am (Mon-Fri) or 8am (Sat-Sun) #hexupdates" : 40
"RT @networkrail: For any emergencies on the railway 24 hours a day call our helpline on 03457 11 41 41 or @BTP on 0800 40 50 40 / emergency\\u2026": 29
"If you've been delayed on our services in excess of 30 mins today, you can find details of how to claim at https://t.co/8mjQWiz2qf" : 27
(Other) :117184
tweeturl
https://twitter.com/_____eleanor/status/1047394623679864833: 1
https://twitter.com/____LP____/status/1063138290029789185 : 1
https://twitter.com/___al94/status/1074970095049412608 : 1
https://twitter.com/___EmmaLou/status/1065522972004360197 : 1
https://twitter.com/___jcx/status/1064481100544843776 : 1
https://twitter.com/___megss/status/1066080101115281411 : 1
(Other) :117512
name tweetid screenname
GA West Anglia : 3666 Min. :5.960e+17 ga_westanglia: 3666
GA Regional : 3653 1st Qu.:1.053e+18 ga_regional : 3653
GA Mainline : 3650 Median :1.063e+18 ga_mainline : 3650
Stansted Express: 3292 Mean :1.053e+18 Stansted_Exp : 3293
Greater Anglia : 2082 3rd Qu.:1.067e+18 greateranglia: 2144
NRE TfLRail : 1804 Max. :1.075e+18 NRE_TfLRail : 1804
(Other) :99371 (Other) :99308
account
greateranglia : 6034
GWRHelp : 6001
SouthernRailUK: 5329
tlupdates : 5066
TLRailUK : 5054
GNRailUK : 4665
(Other) :85369
Some accounts publish too infrequently and need to be removed:
#Remove tweets by two accounts:
tweets <- subset(tweets, tweets$account != "Elizabethline")
tweets <- subset(tweets, tweets$account != "NRE_TfLRail")
Let’s check it again:
sqldf::sqldf("SELECT count(*) as tweets, account
FROM tweets
GROUP BY account
ORDER BY tweets")
That’s good: we only have accounts with more than 3000 tweets. What about those with no account name?
sqldf::sqldf("SELECT *
FROM tweets
WHERE account = ''
LIMIT 10")